{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# ONNX Export\n",
    "\n",
    "## Requirements\n",
    "\n",
    "Brevitas requires Python 3.8+ and PyTorch 1.9.1+ and can be installed from PyPI with `pip install brevitas`. \n",
    "\n",
    "For this notebook, you will also need to install `onnx`, `onnxruntime`, `onnxoptimizer` and `netron` (for visualization of ONNX models).\n",
    "For this tutorial, PyTorch 1.8.1+ is required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-05-09T14:49:50.018639Z",
     "iopub.status.busy": "2025-05-09T14:49:50.017028Z",
     "iopub.status.idle": "2025-05-09T14:49:57.814054Z",
     "shell.execute_reply": "2025-05-09T14:49:57.811032Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: netron in /proj/xlabs/users/nfraser/opt/miniforge3/envs/20231115_brv_pt1.13.1/lib/python3.10/site-packages (7.2.9)\r\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install netron"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Introduction\n",
    "\n",
    "The main goal of this notebook is to show how to use Brevitas to export your models in the two standards currently supported by ONNX for quantized models: QCDQ and QOps (i.e., `QLinearConv`, `QLinearMatMul`). Once exported, these models can be run using onnxruntime.\n",
    "\n",
    "This notebook doesn't cover QONNX, a custom extension over ONNX with more features for quantization representation that Brevitas can generate as export, which requires the `qonnx` library."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## QuantizeLinear-Clip-DeQuantizeLinear (QCDQ)\n",
    "\n",
    "QCDQ is a style of representation introduced by Brevitas that extends the standard QDQ representation for quantization in ONNX. In Q(C)DQ export, before each operation, two (or three, in case of clipping) extra ONNX nodes are added:\n",
    "- `QuantizeLinear`: Takes as input a FP tensor, and quantizes it with a given zero-point and scale factor. It returns an (U)Int8 tensor.\n",
    "- `Clip` (Optional): Takes as input an INT8 tensor, and, given ntenger min/max values, restricts its range.\n",
    "- `DeQuantizeLinear`: Takes as input an INT8 tensor, and converts it to its FP equivalent with a given zero-point and scale factor.\n",
    "\n",
    "There are several implications associated with this set of operations:\n",
    "- It is not possible to quantize with a bit-width higher than 8. Although `DequantizeLinear` supports both (U)Int8 and Int32 as input, currently `QuantizeLinear` can only output (U)Int8.\n",
    "- Using only `QuantizeLinear` and `DeDuantizeLinear`, it is possible only to quantize to 8 bit (signed or unsigned).\n",
    "- The addition of the `Clip` function between `QuantizeLinear` and `DeQuantizeLinear`, allows to quantize a tensor to bit-width < 8. This is done by Clipping the Int8 tensor coming out of the `QuantizeLinear` node with the min/max values of the desired bit-width (e.g., for unsigned 3 bit, `min_val = 0` and `max_val = 7`).\n",
    "- It is possible to perform both per-tensor and per-channel quantization (requires ONNX Opset >=13).\n",
    "\n",
    "We will go through all these cases with some examples.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Basic Example\n",
    "\n",
    "First, we will look at `brevitas.nn.QuantLinear`, a quantized alternative to `torch.nn.Linear`. Similar considerations can also be used for  `QuantConv1d`, `QuantConv2d`,  `QuantConvTranspose1d` and `QuantConvTranspose2d`.\n",
    "\n",
    "Brevitas offers several API to export Pytorch modules into several different formats, all sharing the same interface.\n",
    "The three required arguments are:\n",
    "- The PyTorch model to export\n",
    "- A representative input tensor (or a tuple of input args)\n",
    "- The path where to save the exported model\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2025-05-09T14:49:57.828448Z",
     "iopub.status.busy": "2025-05-09T14:49:57.827713Z",
     "iopub.status.idle": "2025-05-09T14:49:57.855085Z",
     "shell.execute_reply": "2025-05-09T14:49:57.852844Z"
    },
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import netron\n",
    "import time\n",
    "from IPython.display import IFrame\n",
    "\n",
    "# helpers\n",
    "def assert_with_message(condition):\n",
    "    assert condition\n",
    "    print(condition)\n",
    "\n",
    "def show_netron(model_path, port):\n",
    "    time.sleep(3.)\n",
    "    netron.start(model_path, address=(\"localhost\", port), browse=False)\n",
    "    return IFrame(src=f\"http://localhost:{port}/\", width=\"100%\", height=400)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2025-05-09T14:49:57.863763Z",
     "iopub.status.busy": "2025-05-09T14:49:57.863068Z",
     "iopub.status.idle": "2025-05-09T14:50:06.008930Z",
     "shell.execute_reply": "2025-05-09T14:50:06.006760Z"
    },
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import brevitas.nn as qnn\n",
    "import torch\n",
    "from brevitas.export import export_onnx_qcdq\n",
    "\n",
    "IN_CH = 3\n",
    "OUT_CH = 128\n",
    "BATCH_SIZE = 1\n",
    "\n",
    "# set seed\n",
    "torch.manual_seed(0)\n",
    "\n",
    "linear = qnn.QuantLinear(IN_CH, OUT_CH, bias=True)\n",
    "inp = torch.randn(BATCH_SIZE, IN_CH)\n",
    "path = 'quant_linear_qcdq.onnx'\n",
    "\n",
    "exported_model = export_onnx_qcdq(linear, args=inp, export_path=path, opset_version=13)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": [
     "skip-execution"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Serving 'quant_linear_qcdq.onnx' at http://localhost:8082\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"100%\"\n",
       "            height=\"400\"\n",
       "            src=\"http://localhost:8082/\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "            \n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x7fb62ae3fe50>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "show_netron(path, 8082)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "As it can be seen from the exported ONNX, by default in `QuantLinear` only the weights are quantized, and they go through a Quantize/DequantizeLinear before being used for the `Gemm` operation. Moreover, there is a clipping operation that sets the min/max values for the tensor to ±127. This is because in Brevitas the default weight quantizer (but not the activation one) has the option `narrow_range=True`.\n",
    "This option, in case of signed quantization, makes sure that the quantization interval is perfectly symmetric (otherwise, the minimum integer would be -128), so that it can absorb sign changes (e.g. from batch norm fusion).\n",
    "\n",
    "The input and bias remains in floating point. In QCDQ export this is not a problem since the weights, that are quantized at 8 bit, are dequantized to floating-point before passed as input to the `Gemm` node.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Complete Model\n",
    "\n",
    "A similar approach can be used with entire Pytorch models, rather than single layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2025-05-09T14:50:06.018006Z",
     "iopub.status.busy": "2025-05-09T14:50:06.017273Z",
     "iopub.status.idle": "2025-05-09T14:50:06.136026Z",
     "shell.execute_reply": "2025-05-09T14:50:06.134548Z"
    },
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "class QuantModel(torch.nn.Module):\n",
    "    def __init__(self) -> None:\n",
    "        super().__init__()\n",
    "        self.linear = qnn.QuantLinear(IN_CH, OUT_CH, bias=True, weight_scaling_per_output_channel=True)\n",
    "        self.act = qnn.QuantReLU()\n",
    "    \n",
    "    def forward(self, inp):\n",
    "        inp = self.linear(inp)\n",
    "        inp = self.act(inp)\n",
    "        return inp\n",
    "\n",
    "model = QuantModel()\n",
    "inp = torch.randn(BATCH_SIZE, IN_CH)\n",
    "path = 'quant_model_qcdq.onnx'\n",
    "\n",
    "exported_model = export_onnx_qcdq(model, args=inp, export_path=path, opset_version=13)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": [
     "skip-execution"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Serving 'quant_model_qcdq.onnx' at http://localhost:8083\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"100%\"\n",
       "            height=\"400\"\n",
       "            src=\"http://localhost:8083/\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "            \n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x7fb734383710>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "show_netron(path, 8083)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We did not specify the argument `output_quant` in our `QuantLinear` layer, thus the output of the layer will be passed directly to the ReLU function without any intermediate re-quantization step.\n",
    "\n",
    "Furthermore, we have defined a per-channel quantization, so the scale factor will be a Tensor rather than a scalar (ONNX opset >= 13 is required for this).\n",
    "\n",
    "Finally, since we are using a `QuantReLU` with default initialization, the output is re-quantized as an UInt8 Tensor.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### The C in QCDQ (Bitwidth <= 8)\n",
    "\n",
    "As mentioned, Brevitas export expands on the basic QDQ format by adding the `Clip` operation.\n",
    "\n",
    "This operations is inserted between the `QuantizeLinear` and `DeQuantizeLinear` node, and thus operates on integers.\n",
    "\n",
    "Normally, using only the QDQ format, it would be impossible to export models quantize with less than 8 bit.\n",
    "\n",
    "In Brevitas however, if a quantized layer with bit-width <= 8 is exported, the Clip node will be automatically inserted, with the min/max values computed based on the particular type of quantized performed (i.e., signed vs unsigned, narrow range vs no narrow range, etc.).\n",
    "\n",
    "Even though the Tensor data type will still be a Int8 or UInt8, its values are restricted to the desired bit-width."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2025-05-09T14:50:06.143034Z",
     "iopub.status.busy": "2025-05-09T14:50:06.142638Z",
     "iopub.status.idle": "2025-05-09T14:50:06.274118Z",
     "shell.execute_reply": "2025-05-09T14:50:06.272709Z"
    },
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "class Model(torch.nn.Module):\n",
    "    def __init__(self) -> None:\n",
    "        super().__init__()\n",
    "        self.linear = qnn.QuantLinear(IN_CH, OUT_CH, bias=True, weight_bit_width=3)\n",
    "        self.act = qnn.QuantReLU(bit_width=4)\n",
    "    \n",
    "    def forward(self, inp):\n",
    "        inp = self.linear(inp)\n",
    "        inp = self.act(inp)\n",
    "        return inp\n",
    "\n",
    "model = Model()\n",
    "model.eval()\n",
    "\n",
    "inp = torch.randn(BATCH_SIZE, IN_CH)\n",
    "path = 'quant_model_3b_4b_qcdq.onnx'\n",
    "\n",
    "exported_model = export_onnx_qcdq(model, args=inp, export_path=path, opset_version=13)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": [
     "skip-execution"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Serving 'quant_model_3b_4b_qcdq.onnx' at http://localhost:8084\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"100%\"\n",
       "            height=\"400\"\n",
       "            src=\"http://localhost:8084/\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "            \n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x7fb629e8a010>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "show_netron(path, 8084)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "As can be seen from the generated ONNX, the weights of the `QuantLinear` layer are clipped between -3 and 3, considering that we are performing a signed 3 bit quantization, with `narrow_range=True`.\n",
    "\n",
    "Similarly, the output of the QuantReLU is clipped between 0 and 15, since in this case we are doing an unsigned 4 bit quantization."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Clipping in QOps\n",
    "\n",
    "Even when using `QLinearConv` and `QLinearMatMul`, it is still possible to represent bit-width < 8 through the use of clipping.\n",
    "\n",
    "However, in this case the `Clip` operation over the weights won't be captured in the exported ONNX graph. Instead, it will be performed at export-time, and the clipped tensor will be exported in the ONNX graph.\n",
    "\n",
    "Examining the last exported model, it is possible to see that the weight tensor, even though it has Int8 has type, has a min/max values equal to `[-7, 7]`, given that it is quantized at 4 bit with narrow_range set to True.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## ONNX Runtime\n",
    "\n",
    "### QCDQ\n",
    "\n",
    "Since for QCDQ we are only using standard ONNX operation, it is possible to run the exported model using ONNX Runtime."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2025-05-09T14:50:06.283663Z",
     "iopub.status.busy": "2025-05-09T14:50:06.283199Z",
     "iopub.status.idle": "2025-05-09T14:50:07.086039Z",
     "shell.execute_reply": "2025-05-09T14:50:07.084014Z"
    },
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-05-09 15:50:06.990808285 [W:onnxruntime:, graph.cc:1283 Graph] Initializer linear.bias appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.\n"
     ]
    }
   ],
   "source": [
    "import onnxruntime as ort\n",
    "\n",
    "class Model(torch.nn.Module):\n",
    "    def __init__(self) -> None:\n",
    "        super().__init__()\n",
    "        self.linear = qnn.QuantLinear(IN_CH, OUT_CH, bias=True, weight_bit_width=3)\n",
    "        self.act = qnn.QuantReLU(bit_width=4)\n",
    "    \n",
    "    def forward(self, inp):\n",
    "        inp = self.linear(inp)\n",
    "        inp = self.act(inp)\n",
    "        return inp\n",
    "\n",
    "model = Model()\n",
    "model.eval()\n",
    "inp = torch.randn(BATCH_SIZE, IN_CH)\n",
    "path = 'quant_model_3b_4b_qcdq.onnx'\n",
    "\n",
    "exported_model = export_onnx_qcdq(model, args=inp, export_path=path, opset_version=13)\n",
    "\n",
    "sess_opt = ort.SessionOptions()\n",
    "sess = ort.InferenceSession(path, sess_opt)\n",
    "input_name = sess.get_inputs()[0].name\n",
    "pred_onx = sess.run(None, {input_name: inp.numpy()})[0]\n",
    "\n",
    "\n",
    "out_brevitas = model(inp)\n",
    "out_ort = torch.tensor(pred_onx)\n",
    "\n",
    "assert_with_message(torch.allclose(out_brevitas, out_ort))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "#### QGEMM vs GEMM\n",
    "\n",
    "QCDQ allows to execute low precision fake-quantization in ONNX Runtime, meaning operations actually happen among floating-point values. ONNX Runtime is also capable of optimizing and accelerating a QCDQ model leveraging a int8 based QGEMM kernels in some scenarios.\n",
    "\n",
    "This seems to happen only when using a `QuantLinear` layer, with the following requirements:\n",
    "- Input, Weight, Bias, and Output tensors must be quantized;\n",
    "- Bias tensor must be present, and quantized with bitwidth > 8.\n",
    "- The output of the QuantLinear must be re-quantized.\n",
    "- The output bit-width must be equal to 8.\n",
    "- The input bit-width must be equal to 8.\n",
    "- The weights bit-width can be <= 8.\n",
    "- The weights can be quantized per-tensor or per-channel.\n",
    "\n",
    "We did not observe a similar behavior for other operations such as `QuantConvNd`.\n",
    "\n",
    "An example of a layer that will match this definition is the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "execution": {
     "iopub.execute_input": "2025-05-09T14:50:07.094253Z",
     "iopub.status.busy": "2025-05-09T14:50:07.092574Z",
     "iopub.status.idle": "2025-05-09T14:50:07.134212Z",
     "shell.execute_reply": "2025-05-09T14:50:07.132360Z"
    },
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from brevitas.quant.scaled_int import Int32Bias\n",
    "from brevitas.quant.scaled_int import Int8ActPerTensorFloat\n",
    "\n",
    "qgemm_ort = qnn.QuantLinear(\n",
    "    IN_CH, OUT_CH,\n",
    "    weight_bit_width=5,\n",
    "    input_quant=Int8ActPerTensorFloat,\n",
    "    output_quant=Int8ActPerTensorFloat,\n",
    "    bias=True, bias_quant=Int32Bias)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Unfortunately ONNX Runtime does not provide a built-in way to log whether execution goes through unoptimized floating-point GEMM, or int8 QGEMM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Export Dynamically Quantized Models to ONNX "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also export dynamically quantized models to ONNX, but there are some limitations. The ONNX DynamicQuantizeLinear requires the following settings:\n",
    "- Asymmetric quantization (and therefore *unsigned*)\n",
    "- Min-max scaling\n",
    "- Rounding to nearest\n",
    "- Per tensor scaling\n",
    "- Bit width set to 8\n",
    "\n",
    "This is shown in the following example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-05-09T14:50:07.141447Z",
     "iopub.status.busy": "2025-05-09T14:50:07.140768Z",
     "iopub.status.idle": "2025-05-09T14:50:07.913782Z",
     "shell.execute_reply": "2025-05-09T14:50:07.911672Z"
    }
   },
   "outputs": [],
   "source": [
    "from brevitas_examples.common.generative.quantizers import ShiftedUint8DynamicActPerTensorFloat\n",
    "\n",
    "IN_CH = 3\n",
    "IMG_SIZE = 128\n",
    "OUT_CH = 128\n",
    "BATCH_SIZE = 1\n",
    "\n",
    "class Model(torch.nn.Module):\n",
    "    def __init__(self) -> None:\n",
    "        super().__init__()\n",
    "        self.linear = qnn.QuantLinear(IN_CH, OUT_CH, bias=True, weight_bit_width=8, input_quant=ShiftedUint8DynamicActPerTensorFloat)\n",
    "        self.act = qnn.QuantReLU(input_quant=ShiftedUint8DynamicActPerTensorFloat)\n",
    "    \n",
    "    def forward(self, inp):\n",
    "        inp = self.linear(inp)\n",
    "        inp = self.act(inp)\n",
    "        return inp\n",
    "\n",
    "inp = torch.randn(BATCH_SIZE, IN_CH)\n",
    "model = Model() \n",
    "model.eval()\n",
    "path = 'dynamic_quant_model_qcdq.onnx'\n",
    "\n",
    "exported_model = export_onnx_qcdq(model, args=inp, export_path=path, opset_version=13)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2025-05-09T14:50:07.921255Z",
     "iopub.status.busy": "2025-05-09T14:50:07.920534Z",
     "iopub.status.idle": "2025-05-09T14:50:10.952025Z",
     "shell.execute_reply": "2025-05-09T14:50:10.949423Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Serving 'dynamic_quant_model_qcdq.onnx' at http://localhost:8086\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"100%\"\n",
       "            height=\"400\"\n",
       "            src=\"http://localhost:8086/\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "            \n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x7f4a5336ee00>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "show_netron(\"dynamic_quant_model_qcdq.onnx\", 8086)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "b6e150ee02c45d2c3f896173a651a21b25567e05411969bcc0f3a62fa15a0a0b"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
